Many organizations, like the Census, hospitals, and search engine companies, wish to altruistically publish unaggregated data about individuals in order to support research. Such data usually contains personal information about the individuals. The challenge is to 'anonymize' this data such that the sensitive information about individuals is not disclosed, while useful aggregate information is preserved. However, such altruistic data releases can lead to egregious leaks of personal information, like in the case of the well publicized AOL data release fiasco in August 2006.
In the first part of this talk, I will motivate the need for formally defining privacy by showing attacks on a very popular anonymization technique called k-Anonymity. I will then present my work on L-Diversity, a formal definition of privacy, that provably limits privacy breaches against bounded adversaries. In the second part of my talk, I will present some of the challenges I faced in applying formal privacy definitions to a real Census data publishing application, called OnTheMap. I will also describe the techniques I developed to combat data sparsity and to ensure that useful information was published by OnTheMap.
I will conclude by briefly describing a potential application of my work in the development of a privacy-aware platform that may allow web applications to exploit personal data (search & browsing histories, social networks, tags, etc.) to enhance the users' web experience, while provably guaranteeing their privacy.